generated response
ParliaBench: An Evaluation and Benchmarking Framework for LLM-Generated Parliamentary Speech
Koniaris, Marios, Tsipi, Argyro, Tsanakas, Panayiotis
Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.
- Asia > China > Hong Kong (0.04)
- Europe > United Kingdom > Wales (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (14 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Law (1.00)
- Government > Regional Government > Europe Government > United Kingdom Government (1.00)
- Energy (1.00)
- (2 more...)
Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models
Yan, Victoria, Chotkowski, Honor, Wang, Fengran, Li, Xinhui, Yang, Carl, Lu, Jiaying, Yan, Runze, Hu, Xiao, Fedorov, Alex
Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the "Cookie Theft" picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.49)
Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision
Jang, Chaeyun, Choi, Moonseok, Kim, Yegon, Lee, Hyungi, Lee, Juho
Uncertainty calibration is essential for the safe deployment of large language models (LLMs), particularly when users rely on verbalized confidence estimates. While prior work has focused on classifiers or short-form generation, confidence calibration for chain-of-thought (CoT) reasoning remains largely unexplored. Surprisingly, we find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models, without any explicit reasoning supervision or reinforcement learning-based rewards. Despite being trained only to produce a verbalized confidence score without any self-verifying examples, the model learns to generate longer and self-checking responses for low-confidence queries while providing more concise answers for high-confidence ones. We further propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty. Experiments on GSM8K and held-out reasoning tasks such as MATH-500 and ARC-Challenge show that our confidence-aware fine-tuning improves both calibration and accuracy, while also enhancing interpretability by aligning the model's reasoning path with its confidence.
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
MojoBench: Language Modeling and Benchmarks for Mojo
Raihan, Nishat, Santos, Joanna C. S., Zampieri, Marcos
The recently introduced Mojo programming language (PL) by Modular, has received significant attention in the scientific community due to its claimed significant speed boost over Python. Despite advancements in code Large Language Models (LLMs) across various PLs, Mojo remains unexplored in this context. To address this gap, we introduce MojoBench, the first framework for Mojo code generation. MojoBench includes HumanEval-Mojo, a benchmark dataset designed for evaluating code LLMs on Mojo, and Mojo-Coder, the first LLM pretrained and finetuned for Mojo code generation, which supports instructions in 5 natural languages (NLs). Our results show that Mojo-Coder achieves a 30-35% performance improvement over leading models like GPT-4o and Claude-3.5-Sonnet. Furthermore, we provide insights into LLM behavior with underrepresented and unseen PLs, offering potential strategies for enhancing model adaptability. MojoBench contributes to our understanding of LLM capabilities and limitations in emerging programming paradigms fostering more robust code generation systems.
- North America > United States > Virginia > Fairfax County > Fairfax (0.04)
- North America > United States > Indiana > St. Joseph County > Notre Dame (0.04)
Exploring the Readiness of Prominent Small Language Models for the Democratization of Financial Literacy
Kosireddy, Tagore Rao, Wall, Jeffrey D., Lucas, Evan
The use of small language models (SLMs), herein defined as models with less than three billion parameters, is increasing across various domains and applications. Due to their ability to run on more accessible hardware and preserve user privacy, SLMs possess the potential to democratize access to language models for individuals of different socioeconomic status and with different privacy preferences. This study assesses several state-of-the-art SLMs (e.g., Apple's OpenELM, Microsoft's Phi, Google's Gemma, and the Tinyllama project) for use in the financial domain to support the development of financial literacy LMs. Democratizing access to quality financial information for those who are financially under educated is greatly needed in society, particularly as new financial markets and products emerge and participation in financial markets increases due to ease of access. We are the first to examine the use of open-source SLMs to democratize access to financial question answering capabilities for individuals and students. To this end, we provide an analysis of the memory usage, inference time, similarity comparisons to ground-truth answers, and output readability of prominent SLMs to determine which models are most accessible and capable of supporting access to financial information. We analyze zero-shot and few-shot learning variants of the models. The results suggest that some off-the-shelf SLMs merit further exploration and fine-tuning to prepare them for individual use, while others may have limits to their democratization.
- North America > United States > Michigan (0.04)
- North America > United States > New York (0.04)
- Asia > India (0.04)
- (5 more...)
- Banking & Finance > Trading (1.00)
- Banking & Finance > Economy (0.92)
- Consumer Products & Services > Retirement (0.71)
- (2 more...)
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents
Chae, Hyungjoo, Song, Yongho, Ong, Kai Tzu-iunn, Kwon, Taeyoon, Kim, Minjin, Yu, Youngjae, Lee, Dongha, Kang, Dongyeop, Yeo, Jinyoung
Human-like chatbots necessitate the use of commonsense reasoning in order to effectively comprehend and respond to implicit information present within conversations. Achieving such coherence and informativeness in responses, however, is a non-trivial task. Even for large language models (LLMs), the task of identifying and aggregating key evidence within a single hop presents a substantial challenge. This complexity arises because such evidence is scattered across multiple turns in a conversation, thus necessitating integration over multiple hops. Hence, our focus is to facilitate such multi-hop reasoning over a dialogue context, namely dialogue chain-of-thought (CoT) reasoning. To this end, we propose a knowledge distillation framework that leverages LLMs as unreliable teachers and selectively distills consistent and helpful rationales via alignment filters. We further present DOCTOR, a DialOgue Chain-of-ThOught Reasoner that provides reliable CoT rationales for response generation. We conduct extensive experiments to show that enhancing dialogue agents with high-quality rationales from DOCTOR significantly improves the quality of their responses.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > Dominican Republic (0.04)
- (5 more...)
- Transportation > Passenger (0.68)
- Transportation > Ground > Road (0.68)
The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python
Miceli-Barone, Antonio Valerio, Barez, Fazl, Konstas, Ioannis, Cohen, Shay B.
Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.
- Europe > Ireland (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Germany > Berlin (0.04)
- (2 more...)
Diversifying Task-oriented Dialogue Response Generation with Prototype Guided Paraphrasing
Lippe, Phillip, Ren, Pengjie, Haned, Hinda, Voorn, Bart, de Rijke, Maarten
Existing methods for Dialogue Response Generation (DRG) in Task-oriented Dialogue Systems (TDSs) can be grouped into two categories: template-based and corpus-based. The former prepare a collection of response templates in advance and fill the slots with system actions to produce system responses at runtime. The latter generate system responses token by token by taking system actions into account. While template-based DRG provides high precision and highly predictable responses, they usually lack in terms of generating diverse and natural responses when compared to (neural) corpus-based approaches. Conversely, while corpus-based DRG methods are able to generate natural responses, we cannot guarantee their precision or predictability. Moreover, the diversity of responses produced by today's corpus-based DRG methods is still limited. We propose to combine the merits of template-based and corpus-based DRGs by introducing a prototype-based, paraphrasing neural network, called P2-Net, which aims to enhance quality of the responses in terms of both precision and diversity. Instead of generating a response from scratch, P2-Net generates system responses by paraphrasing template-based responses. To guarantee the precision of responses, P2-Net learns to separate a response into its semantics, context influence, and paraphrasing noise, and to keep the semantics unchanged during paraphrasing. To introduce diversity, P2-Net randomly samples previous conversational utterances as prototypes, from which the model can then extract speaking style information. We conduct extensive experiments on the MultiWOZ dataset with both automatic and human evaluations. The results show that P2-Net achieves a significant improvement in diversity while preserving the semantics of responses.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
- Information Technology > Information Management (0.92)